Add predicate pushdown (updated) #78

mike-luabase · 2024-10-16T13:39:19Z

This PR enhances query performance by filtering data at the metadata level, reducing the amount of data read during scans. It currently only addresses queries containing a single Iceberg table.

Key Changes

Extended IcebergManifestEntry:
- Added lower_bounds and upper_bounds maps to store column statistics.
Utility Function:
- Implemented IcebergUtils::GetFullPath to resolve file paths accurately.
Metadata Retrieval:
- Added GetEntries template in IcebergTable to fetch relevant manifest entries, excluding deleted ones.
Predicate Evaluation:
- Created EvaluatePredicateAgainstStatistics to assess if data files satisfy query predicates based on their statistics.
- For each predicate:
  - Identifies the column involved.
  - Checks if the column has defined lower and upper bounds.
  - Based on the comparison type (e.g., =, >, <), determines if the predicate can be satisfied given the file's bounds.
  - If any predicate fails, the file is excluded from the scan.
Scan Expression Modification:
- Updated MakeScanExpression to filter data_file_entries using predicates before scanning.
Binding Function Enhancements:
- Enhanced IcebergScanBindReplace with additional logging and prepared data files based on predicate results.

This implementation optimizes Iceberg table scans by leveraging metadata for early data filtering, significantly improving query efficiency and resource usage.

Mytherin · 2024-10-16T13:44:08Z

src/iceberg_functions/iceberg_scan.cpp

-	select_statement->node = std::move(select_node);
-	return make_uniq<SubqueryRef>(std::move(select_statement), "iceberg_scan");
+    vector<Value> structs;
+    for (const auto &file : data_file_values) {


The majority of lines changed here are still formatting changes - could you format using our clang-format config?

@Mytherin fixed the formatting, sorry about that!

samansmink · 2024-10-16T13:47:06Z

Also @mike-luabase, you don't need to reopen the PR every time, this makes it harder to review actually since we lose a clear view on the review comments that were made before.

If you want to overwrite commits you can just force push to your feature branch

mike-luabase · 2024-10-16T14:10:53Z

Also @mike-luabase, you don't need to reopen the PR every time, this makes it harder to review actually since we lose a clear view on the review comments that were made before.

Yes, understood, won't do that again.

mike-luabase · 2024-10-16T15:28:51Z

This test is failing, looking into it

SELECT count(*) FROM ICEBERG_SCAN('data/iceberg/generated_spec2_0_001/pyspark_iceberg_table');
Lower bounds is null
Upper bounds is null
Binder Error: Table "iceberg_scan_data" does not have a column named "filename"

mike-luabase · 2024-10-16T21:02:55Z

@Mytherin test failure should be fixed now

mike-luabase · 2024-10-17T09:27:46Z

@samansmink @Mytherin can we trigger the run again? This error doesn't seem related to my changes:

#7 [ 3/14] RUN apt-get install -y -qq software-properties-common
#7 0.232 E: Unable to locate package software-properties-common
#7 ERROR: process "/bin/sh -c apt-get install -y -qq software-properties-common" did not complete successfully: exit code: 100
------
 > [ 3/14] RUN apt-get install -y -qq software-properties-common:
0.232 E: Unable to locate package software-properties-common
------
Dockerfile:9
--------------------
   7 |     # Setup the basic necessities
   8 |     RUN apt-get update -y -qq
   9 | >>> RUN apt-get install -y -qq software-properties-common
  10 |     RUN apt-get install -y -qq --fix-missing ninja-build make gcc-multilib g++-multilib libssl-dev wget openjdk-8-jdk zip maven unixodbc-dev libc6-dev-i386 lib32readline6-dev libssl-dev libcurl4-gnutls-dev libexpat1-dev gettext unzip build-essential checkinstall libffi-dev curl libz-dev openssh-client pkg-config autoconf
  11 |     RUN apt-get install -y -qq ccache
--------------------
ERROR: failed to solve: process "/bin/sh -c apt-get install -y -qq software-properties-common" did not complete successfully: exit code: 100

catkins · 2024-10-18T01:22:49Z

@mike-luabase 💚 I'm so excited to see movement on this! I'll be able to delete a bunch of manual partition pruning + read_parquet code in an app of mine!

mike-luabase · 2024-10-18T13:57:18Z

@samansmink anything I can do to help here?

samansmink · 2024-10-18T14:17:03Z

If you could re-add a PR description and address @Mytherin's comment from one of your previous PRs #75 (comment), I will take a more detailed look to review this.

Please be considerate that reviews take a lot of time, especially reviews from outside contributors that are touching complicated code. To get your PRs through as quick as possible, I recommend you to:

double check your ideas for a PR with a core contributor of DuckDB through either the DuckDB discord channel, a github issue or a github discussion. See (https://github.com/duckdb/duckdb/blob/main/CONTRIBUTING.md)
Make sure that your PRs are easy to review
- no big PRs
- don't add big format changes to your PRs
- write a clear description of what you are adding and why you chose the path you chose

mike-luabase · 2024-10-20T12:57:03Z

@samansmink I added the description and fixed the formatting!

I understand this PR is on the larger side, but I think it's as small as it could be to get predicate pushdown working.

I'm more than happy to discuss the PR! I'll drop a note in the Discord.

Mytherin

Thanks for reducing the changeset - this is easier to review. I still wonder about the changes to the Avro files. In addition - there's no tests included with the changeset. Could you add tests that stress test the predicate pushdown in various ways (e.g. testing out different predicate types, different data types, etc)?

Mytherin · 2024-10-20T16:36:25Z

src/include/avro_codegen/iceberg_manifest_entry_partial.hpp


+/* This code was generated by avrogencpp 1.13.0-SNAPSHOT. Do not edit.*/
+
+#ifndef MANIFEST_ENTRY_HH_2043678367_H


Could you expand a bit on why the avro/iceberg_types files are being changed?

@Mytherin the existing types didn't have a way to access lower_bounds and upper_bounds, which we need for pushdown.

mike-luabase · 2024-10-21T15:04:29Z

on it!

mike-luabase · 2024-10-25T18:20:42Z

@Mytherin I added tests and updated the data generation for predicate pushdown. Let me know if that works!

mike-luabase · 2024-10-27T12:26:27Z

I see files from the /data/iceberg have somehow made it back into the PR. I'm trying to remove them, but not having much luck.

samansmink · 2024-11-05T13:05:18Z

As mentioned before, @mike-luabase why is this PR removing the checked in test data?

peterboncz

Thanks for the efforts, Mike. Was looking through it; but I would not call it a review and in any case I think DuckDB Labs folks need to do that. I marked a number of diffs in the PR that I think could be removed: the deletion of the generated files (as already remarked by Sam) and some smaller things.

peterboncz · 2024-10-25T09:29:14Z

src/include/iceberg_types.hpp

+				lower_bounds[std::to_string(lb.key)] = lb.value;
+			}
+		} else {
+			fprintf(stderr, "Lower bounds ISSSSS null\n");


ISSSSS?

is printing to stderr the right way to signal the absence of bounds info in a manifest?

peterboncz · 2024-11-12T18:14:25Z

scripts/test_data_generator/generate_iceberg.py

@@ -1,4 +1,4 @@
-#!/usr/bin/python3
+#!/Users/mritchie712/opt/anaconda3/bin/python


this will break on other computers. In this file there are some useful changes (partitioning) but also spurious changes likes semicolons and spacing.. It would be nice to remove the spurious changes to make the diff as small as possible

peterboncz · 2024-11-12T18:19:13Z

src/common/iceberg.cpp

@@ -80,8 +80,8 @@ vector<IcebergManifestEntry> IcebergTable::ReadManifestEntries(const string &pat
 		}
 	} else {
 		auto schema = avro::compileJsonSchemaFromString(MANIFEST_ENTRY_SCHEMA);
-		avro::DataFileReader<c::manifest_entry> dfr(std::move(stream), schema);
-		c::manifest_entry manifest_entry;
+		avro::DataFileReader<manifest_entry> dfr(std::move(stream), schema);


these changes are spurious and hence this file change could be removed from the PR

mike-luabase and others added 2 commits October 15, 2024 16:00

add predicate pushdown

b96686e

add predicate pushdown

f0c99b5

mike-luabase changed the title ~~Add predicate pushdown4~~ Add predicate pushdown (updated) Oct 16, 2024

mike-luabase mentioned this pull request Oct 16, 2024

Add predicate pushdown #75

Closed

Mytherin reviewed Oct 16, 2024

View reviewed changes

formatting

2711c8b

mike-luabase added 2 commits October 16, 2024 16:59

fixed filename issue

3eb21f1

fixed filename issue

fc52c3e

mike-luabase added 2 commits October 20, 2024 08:51

updated formatting

5b5929c

updated formatting

72a4af9

Mytherin reviewed Oct 21, 2024

View reviewed changes

mike-luabase added 2 commits October 25, 2024 12:46

added pred push tests

8c9e68b

added pred push tests2

91e9ddf

mike-luabase added 3 commits October 25, 2024 14:35

fixed formatting, removed /data

cc3d46a

removed /data

6b1440f

Force remove tracked files from data/iceberg

d9600d4

mike-luabase force-pushed the add-predicate-pushdown4 branch 3 times, most recently from eb8c753 to 136b5eb Compare October 27, 2024 12:12

mike-luabase force-pushed the add-predicate-pushdown4 branch from 136b5eb to d9600d4 Compare October 27, 2024 12:24

peterboncz reviewed Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add predicate pushdown (updated) #78

Add predicate pushdown (updated) #78

mike-luabase commented Oct 16, 2024 •

edited

Loading

Mytherin Oct 16, 2024

mike-luabase Oct 16, 2024

samansmink commented Oct 16, 2024

mike-luabase commented Oct 16, 2024

mike-luabase commented Oct 16, 2024

mike-luabase commented Oct 16, 2024

mike-luabase commented Oct 17, 2024

catkins commented Oct 18, 2024

mike-luabase commented Oct 18, 2024

samansmink commented Oct 18, 2024

mike-luabase commented Oct 20, 2024

Mytherin left a comment

Mytherin Oct 20, 2024

mike-luabase Oct 21, 2024

mike-luabase commented Oct 21, 2024

mike-luabase commented Oct 25, 2024

mike-luabase commented Oct 27, 2024

samansmink commented Nov 5, 2024

peterboncz left a comment

peterboncz Oct 25, 2024

peterboncz Nov 12, 2024

peterboncz Nov 12, 2024


		/* This code was generated by avrogencpp 1.13.0-SNAPSHOT. Do not edit.*/

		#ifndef MANIFEST_ENTRY_HH_2043678367_H

		@@ -1,4 +1,4 @@
		#!/usr/bin/python3
		#!/Users/mritchie712/opt/anaconda3/bin/python

Add predicate pushdown (updated) #78

Are you sure you want to change the base?

Add predicate pushdown (updated) #78

Conversation

mike-luabase commented Oct 16, 2024 • edited Loading

Key Changes

Mytherin Oct 16, 2024

Choose a reason for hiding this comment

mike-luabase Oct 16, 2024

Choose a reason for hiding this comment

samansmink commented Oct 16, 2024

mike-luabase commented Oct 16, 2024

mike-luabase commented Oct 16, 2024

mike-luabase commented Oct 16, 2024

mike-luabase commented Oct 17, 2024

catkins commented Oct 18, 2024

mike-luabase commented Oct 18, 2024

samansmink commented Oct 18, 2024

mike-luabase commented Oct 20, 2024

Mytherin left a comment

Choose a reason for hiding this comment

Mytherin Oct 20, 2024

Choose a reason for hiding this comment

mike-luabase Oct 21, 2024

Choose a reason for hiding this comment

mike-luabase commented Oct 21, 2024

mike-luabase commented Oct 25, 2024

mike-luabase commented Oct 27, 2024

samansmink commented Nov 5, 2024

peterboncz left a comment

Choose a reason for hiding this comment

peterboncz Oct 25, 2024

Choose a reason for hiding this comment

peterboncz Nov 12, 2024

Choose a reason for hiding this comment

peterboncz Nov 12, 2024

Choose a reason for hiding this comment

mike-luabase commented Oct 16, 2024 •

edited

Loading